{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Selection - 1\n",
    "- **k-fold cross-validation** using ```scikit-learn wrapper```\n",
    "- **grid search** using ```scikit-learn wrapper```\n",
    "- **random search** using ```scikit-learn wrapper```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **k-fold cross validation**\n",
    "- One of the most common method for model validation and selection\n",
    "    - Partition training dataset into ```k``` subsets and choose one of partitions as **validation set** and other remaining subset as **training set**\n",
    "    - Then, train model using **training set** and validate using **validation set**\n",
    "    - Average validation results of ```k``` rounds of partitions and training/validation\n",
    "    - Compare the results\n",
    "    \n",
    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/1/1c/K-fold_cross_validation_EN.jpg\" style=\"width: 500px\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## **Grid search & Random search**\n",
    "- **Grid search** and **random search** are two methods for **hyperparameter tuning**\n",
    "    - **Grid search** tries all possible combinations of hyperparameter values specified\n",
    "    - **Random search** implements *randomized search over parameters*, in which each trial is a sample from possible hyperparameter distributions\n",
    "    \n",
    "<img src=\"https://cdn-images-1.medium.com/max/1600/1*ZTlQm_WRcrNqL-nLnx6GJA.png\" style=\"width: 500px\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load dataset\n",
    "- ```imdb``` dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.datasets import imdb\n",
    "from keras.preprocessing import sequence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = 5000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2494\n",
      "2315\n",
      "11\n",
      "7\n"
     ]
    }
   ],
   "source": [
    "# printing out maximum & minimun length of sentences\n",
    "print(len(max(X_train, key = len)))\n",
    "print(len(max(X_test, key = len)))\n",
    "print(len(min(X_train, key = len)))\n",
    "print(len(min(X_test, key = len)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train = sequence.pad_sequences(X_train, maxlen = 500)\n",
    "X_test = sequence.pad_sequences(X_test, maxlen = 500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(25000, 500)\n",
      "(25000, 500)\n",
      "(25000,)\n",
      "(25000,)\n"
     ]
    }
   ],
   "source": [
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "print(y_train.shape)\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- When using ```Sequential``` model API, ```scikit-learn wrapper``` can be used\n",
    "- Define function to create model, and pass on to ```KerasClassifier```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from keras.models import Sequential\n",
    "from keras.layers import *\n",
    "from keras import optimizers\n",
    "from keras.wrappers.scikit_learn import KerasClassifier\n",
    "from sklearn.model_selection import KFold, cross_val_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def imdb_model(embed_dim = 100, lstm = True, lr = 0.01):\n",
    "    model = Sequential()\n",
    "    model.add(Embedding(5000, embed_dim))\n",
    "    if lstm:\n",
    "        model.add(CuDNNLSTM(100))\n",
    "    else:\n",
    "        model.add(CuDNNGRU(100))\n",
    "            \n",
    "    model.add(Dense(100))\n",
    "    model.add(Activation('relu'))\n",
    "    model.add(Dropout(0.3))\n",
    "    model.add(Dense(100))\n",
    "    model.add(Activation('relu'))\n",
    "    model.add(Dropout(0.3))\n",
    "    model.add(Dense(1))\n",
    "    model.add(Activation('sigmoid'))\n",
    "    \n",
    "    adam = optimizers.adam(lr = lr)\n",
    "    model.compile(loss = 'binary_crossentropy', optimizer = adam, metrics = ['accuracy'])\n",
    "    return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = KerasClassifier(build_fn = imdb_model, epochs = 10, batch_size = 128, verbose = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ```k-fold cross-validation``` using ```scikit-learn wrapper```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# set k = 5\n",
    "results = cross_val_score(model, X_train, y_train, cv = 5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Cross validation Results: \n",
      "1th round accuracy: 0.8576\n",
      "2th round accuracy: 0.8768\n",
      "3th round accuracy: 0.8642\n",
      "4th round accuracy: 0.8698\n",
      "5th round accuracy: 0.8654\n",
      "Average accuracy:  0.86676\n",
      "Standard deviation:  0.00636163500996\n"
     ]
    }
   ],
   "source": [
    "# print out results\n",
    "# in most cases, average accuracy and standard deviation are meaningful metrics\n",
    "print('Cross validation Results: ')\n",
    "for i in range(len(results)):\n",
    "    print('{}th round accuracy: {}'.format(i+1, results[i]))\n",
    "print('Average accuracy: ', results.mean())\n",
    "print('Standard deviation: ', results.std())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ```grid search``` using ```scikit-learn wrapper```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first define hyperparameter grid\n",
    "embed_dim = [100, 300]\n",
    "lstm = [True, False]\n",
    "lr = [0.001, 0.01]\n",
    "batch_size = [64, 128, 256]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "hyperparam_grid = {'embed_dim': embed_dim, 'lstm': lstm, 'lr': lr, 'batch_size': batch_size}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = KerasClassifier(build_fn = imdb_model, epochs = 10, verbose = 1)\n",
    "clf = GridSearchCV(estimator = model, param_grid = hyperparam_grid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "grid_result = clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "means = clf.cv_results_['mean_test_score']\n",
    "stds = clf.cv_results_['std_test_score']\n",
    "params = clf.cv_results_['params']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best test accuracy:  0.867440000012\n",
      "Standard Deviation of Accuracies:  0.00621257665123\n",
      "Parameter Setting:  {'batch_size': 64, 'embed_dim': 300, 'lr': 0.001, 'lstm': False}\n"
     ]
    }
   ],
   "source": [
    "# displaying best results & parameter settings\n",
    "max_idx = np.argmax(means)\n",
    "print('Best test accuracy: ', means[max_idx])\n",
    "print('Standard Deviation of Accuracies: ', stds[max_idx])\n",
    "print('Parameter Setting: ', params[max_idx])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ```random search``` using ```scikit-learn wrapper```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from scipy.stats import randint, uniform\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "embed_dim = randint(100, 300)\n",
    "lstm = [True, False]\n",
    "lr = uniform(0.001, 0.1)\n",
    "batch_size = randint(64, 256)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "hyperparam_dist = {'embed_dim': embed_dim, 'lstm': lstm, 'lr': lr, 'batch_size': batch_size}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "model = KerasClassifier(build_fn = imdb_model, epochs = 10, verbose = 1)\n",
    "clf = RandomizedSearchCV(estimator = model, param_distributions = hyperparam_dist, n_iter = 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "random_result = clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "means = clf.cv_results_['mean_test_score']\n",
    "stds = clf.cv_results_['std_test_score']\n",
    "params = clf.cv_results_['params']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best test accuracy:  0.861520008204\n",
      "Standard Deviation of Accuracies:  0.00517948234862\n",
      "Parameter Setting:  {'batch_size': 153, 'embed_dim': 121, 'lr': 0.0052380206595373227, 'lstm': True}\n"
     ]
    }
   ],
   "source": [
    "# displaying best results & parameter settings\n",
    "max_idx = np.argmax(means)\n",
    "print('Best test accuracy: ', means[max_idx])\n",
    "print('Standard Deviation of Accuracies: ', stds[max_idx])\n",
    "print('Parameter Setting: ', params[max_idx])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
